This project dives deep into a dataset named from the UCI Machine Learning Repository, which contains data on student performance in math programs at two Portuguese secondary schools. The data was collected using school reports and student questionaires and features academic, personal, and social aspects of a student’s life.
This dataset is interesting because it shows how multiple aspects of a student’s life correlate to their scores in school. In addition, the features (G1, G2, G3) represent period 1-3 of a school year (similar to a midterm in the United States being period 1/2 and finals being 2/2.) This is interesting because you can see if a student’s score decreases over time (falling behind gradually due to external factors) or maintains a consistent grade throughout the year- these features also allow a ML model to predict G3 based off G1, G2, and the other external factors.
Our goal is to determine factors that have the most significant impact on student performance in the math program. In addition, we hope to discover factors which may seem influential based on intuition but prove otherwise after analyzing the data.
We utilized the following libraries:
Using transmute(), we changed character columns to factors, and those with levels have been ordered. We used transmute instead of mutate so original columns would drop and there would be no redundant data. There are no missing values in the dataset which was confirmed with sum(is.na()).
The “DT” package is an R interface to the Javascript library “DataTables”. With help from the documentation, we enabled horizontal scrolling, auto column widths, and centered values within each column via a list of lists.
A quick way for us to explore the data was via correlation heatmap. Strong positive correlations are set to red, while strongly negative to blue. This plot was used as a starting point for the questions we wanted answered, as well as basic plots that could then benefit from a third degree of comparison (factors).
Intuitive relationships:
Some relationships to look into more:
Features not in the correlation heatmap: (Potential 3rd Degrees of Comparison)
| Factor | Level | Factor Avg | Difference |
|---|---|---|---|
| school | GP | 10.49 | 0.11 |
| school | MS | 9.62 | -0.76 |
| sex | F | 9.97 | -0.41 |
| sex | M | 10.83 | 0.45 |
| address | R | 9.51 | -0.87 |
| address | U | 10.63 | 0.25 |
| fam_size | GT3 | 10.13 | -0.25 |
| fam_size | LE3 | 11.00 | 0.62 |
| parental_stat | A | 11.20 | 0.81 |
| parental_stat | T | 10.29 | -0.09 |
| mom_edu | 2 | 9.73 | -0.65 |
| mom_edu | 3 | 10.30 | -0.08 |
| dad_edu | 2 | 10.26 | -0.12 |
| dad_edu | 3 | 10.66 | 0.28 |
| dad_edu | 4 | 11.19 | 0.81 |
| mom_job | other | 9.82 | -0.56 |
| mom_job | services | 11.02 | 0.64 |
| mom_job | teacher | 10.79 | 0.41 |
| dad_job | at_home | 10.00 | -0.38 |
| dad_job | other | 10.18 | -0.20 |
| dad_job | services | 10.24 | -0.14 |
| attend_reason | course | 9.78 | -0.60 |
| attend_reason | home | 10.26 | -0.12 |
| attend_reason | other | 11.17 | 0.79 |
| attend_reason | reputation | 11.07 | 0.68 |
| guardian | father | 10.69 | 0.31 |
| guardian | mother | 10.43 | 0.05 |
| school_support | no | 10.52 | 0.14 |
| school_support | yes | 9.43 | -0.95 |
| family_support | no | 10.55 | 0.17 |
| family_support | yes | 10.27 | -0.11 |
| extra_paid_classes | no | 9.93 | -0.45 |
| extra_paid_classes | yes | 10.92 | 0.54 |
| activities | no | 10.27 | -0.11 |
| activities | yes | 10.49 | 0.11 |
| nursery_school | no | 9.81 | -0.57 |
| nursery_school | yes | 10.54 | 0.15 |
| pursue_higher_edu | yes | 10.57 | 0.19 |
| internet_use | yes | 10.62 | 0.24 |
| romantic | no | 10.78 | 0.40 |
| romantic | yes | 9.58 | -0.81 |
| family_relationship | 1 | 10.62 | 0.24 |
| family_relationship | 2 | 9.89 | -0.49 |
| family_relationship | 3 | 10.04 | -0.34 |
| family_relationship | 4 | 10.36 | -0.02 |
| family_relationship | 5 | 10.69 | 0.31 |
| free_time | 1 | 9.84 | -0.54 |
| free_time | 3 | 9.71 | -0.67 |
| free_time | 4 | 10.43 | 0.05 |
| free_time | 5 | 11.30 | 0.92 |
| go_out_w_friends | 1 | 9.87 | -0.51 |
| go_out_w_friends | 2 | 11.04 | 0.66 |
| go_out_w_friends | 3 | 10.96 | 0.58 |
| go_out_w_friends | 4 | 9.65 | -0.73 |
| workday_alcohol | 1 | 10.68 | 0.30 |
| workday_alcohol | 3 | 10.50 | 0.12 |
| workday_alcohol | 4 | 9.89 | -0.49 |
| workday_alcohol | 5 | 10.67 | 0.29 |
| weekend_alcohol | 1 | 10.74 | 0.35 |
| weekend_alcohol | 2 | 9.94 | -0.44 |
| weekend_alcohol | 3 | 10.72 | 0.34 |
| weekend_alcohol | 4 | 9.69 | -0.69 |
| weekend_alcohol | 5 | 10.14 | -0.24 |
| health | 2 | 10.22 | -0.16 |
| health | 3 | 10.01 | -0.37 |
| health | 4 | 9.93 | -0.45 |
| health | 5 | 10.40 | 0.02 |
| Factor | Level | Factor Avg | Difference |
|---|---|---|---|
| mom_edu | 0 | 13.00 | 2.62 |
| mom_edu | 1 | 8.68 | -1.70 |
| dad_edu | 0 | 13.00 | 2.62 |
| mom_job | health | 12.15 | 1.77 |
| dad_job | teacher | 11.97 | 1.58 |
| pursue_higher_edu | no | 6.80 | -3.58 |
After calculating the average final grade, all factor columns ran through a custom function to compare the average final grade of students across different factor levels.
We then combined the results with bind_rows()- levels with averages < 1 point of the overall mean were considered to have minimal individual impact, while those that differed by ≥ 1.5 points were classified to have significant individual impact.
Left: Students’ average final grades increase with a higher combined parental education level.
Right: There is no clear relationship between final grades and either family relationship quality or family educational support.
Across all buckets of student study time, average final grades are marginally better for students with internet access at home.
Top: Shows the distribution of student grades across different weekly study time categories for each academic period. The red text notates the number of failures for those time periods. Students in groups 3 and 4 have moderately higher mean grades across all periods, but still have failures in the final period illustrating studying alone does not prevent poor grades.
Bottom Left: Shows the trending grades of students in each of the 4 study groups throughout the school year. The sharpest decline is group 4 from term 2 to the final- indicating either burnout or lack of information retention due to studying > 10 hours a week. The consistent decline across all 4 groups from term 1 to final indicated all the averages are being dragged down by failing students.
Bottom Right: Excludes failed students from average grade calculations. Now, all four groups show a consistent positive trend in averages with higher study times being associated with higher averages.
Shamim, A. (n.d.). Math-Students Performance Data [Data set]. Kaggle. https://www.kaggle.com/datasets/adilshamim8/math-students/data
Dua, D., & Graff, C. (2017). Student Performance Data Set. UCI Machine Learning Repository. https://archive.ics.uci.edu/dataset/320/student+performance
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira (Eds.), Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008), pp. 5-12, Porto, Portugal, April 2008, EUROSIS, ISBN 978-9077381-39-7.